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Abstract 

Given a collection of r > 2 linear regression problems in p dimensions, suppose that 
the regression coefficients share partially common supports. This set-up suggests the 
use of £i/£oo-regularized regression for joint estimation of the p x r matrix of regres- 
sion coefficients. We analyze the high-dimensional scaling of i\ /^oo-regularized quadratic 
programming, considering both consistency rates in ^oo-norm, and also how the mini- 
mal sample size n required for performing variable selection grows as a function of the 
model dimension, sparsity, and overlap between the supports. We begin by establish- 
ing bounds on the £oo-error as well sufficient conditions for exact variable selection for 
fixed design matrices, as well as designs drawn randomly from general Gaussian matri- 
ces. Our second set of results applies to r = 2 linear regression problems with standard 
Gaussian designs whose supports overlap in a fraction a £ [0, 1] of their entries: for this 
problem class, we prove that the ^i/^-regularized method undergoes a phase transition — 
that is, a sharp change from failure to success — characterized by the rescaled sample size 
9i,oo(n,p, s, a) = n/{{A — 3o)slog(p — (2 — a)s)}. More precisely, given sequences of 
problems specified by (n,p, s, a), for any 5 > 0, the probability of successfully recovering 
both supports converges to 1 if #i i0O (n,p, s, a) > 1 + S, and converges to for problem 
sequences for which #i jCO (n,p, s, a) < 1 — 5. An implication of this threshold is that use 
of i\ /^oo-regularization yields improved statistical efficiency if the overlap parameter is 
large enough (a > 2/3), but has worse statistical efficiency than a naive Lasso-based 
approach for moderate to small overlap (a < 2/3). Empirical simulations illustrate the 
close agreement between these theoretical predictions, and the actual behavior in practice. 
These results indicate that some caution needs to be exercised in the application of l\ / 
block regularization: if the data does not match its structure closely enough, it can impair 
statistical performance relative to computationally less expensive schemesLj 



1 Introduction 

The area of high-dimensional statistical inference is concerned with the behavior of models 
and algorithms in which the dimension p is comparable to, or possibly even larger than the 
sample size n. In the absence of additional structure, it is well-known that many standard 
procedures — among them linear regression and principal component analysis — are not consis- 
tent unless the ratio p/n converges to zero. Since this scaling precludes having p comparable 

1 This work was presented in part at the NIPS 2008 conference in Vancouver, Canada, December 2008. 
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to or larger than n, an active line of research is based on imposing structural conditions on 
the data (e.g., sparsity, manifold constraints, or graphical model structure), and studying the 
high-dimensional consistency (or inconsistency) of various types of estimators. 

This paper deals with high-dimensional scaling in the context of solving multiple regression 
problems, where the regression vectors are assumed to have shared sparse structure. More 
specifically, suppose that we are given a collection of r different linear regression models in p 
dimensions, with regression vectors /3 l S W, for i = 1, . . . , r. We let S(j3 l ) = {j \ /3j 7^ 0} 
denote the support set of j3 l . In many applications — among them sparse approximation, 
graphical model selection, and image reconstruction — it is natural to impose a sparsity con- 
straint, corresponding to restricting the cardinality |<S(/3')| of each support set. Moreover, 
one might expect some amount of overlap between the sets S(P l ) and S(/3 J ) for indices i ^ j 
since they correspond to the sets of active regression coefficients in each problem. Let us 
consider some examples to illustrate: 

• Consider the problem of image denoising or compression, say using a wavelet transform 
or some other type of multiresolution basis [T7j. It is well known that natural images 
tend to have sparse representations in such bases [27] . Moreover, similar images — say 
the same scene taken from multiple cameras — would be expected to share a similar 
subset of active features in the reconstruction. Consequently, one might expect that 
using a block-regularizer that enforces such joint sparsity could lead to improved image 
denoising or compression. 

• Consider the problem of identifying the structure of a Markov network or graphical 
model [10] based on a collection of samples (e.g., such as observations of a social net- 
work). For networks with a single parameter per edge (e.g., Gaussian models [19], Ising 
models [25]), a line of recent work has shown that ^i-based methods can be successful 
in recovering the network structure. However, many graphical models have multiple 
parameters per edge (e.g., for discrete models with non-binary state spaces), and it is 
natural that the subset of parameters associated with a given edge are zero (or non- 
zero) in a grouped manner. Thus, any method for recovering the graph structure should 
impose a block-structured regularization that groups together the subset of parameters 
associated with a single edge. 

• Finally, consider a standard problem in genetic analysis: given a set of gene expression 
arrays, where each array corresponds to a different patient but the same underlying 
tissue type (e.g., tumor), the goal is to discover the subset of features relevant for 
tumorous growths. This problem can be expressed as a joint regression problem, again 
with a shared sparsity constraint coupling together the different patients. In this context, 
the recent work of Liu et al. [T3] shows that imposing additional structural constraints 
can be beneficial (e.g., they are able to greatly reduce the number of expressed genes 
while maintaining the same prediction performance). 

Given these structural conditions of shared sparsity in these and other applications, it is 
reasonable to consider how this common structure can be exploited so as to increase the 
statistical efficiency of estimation procedures. 

There is now a substantial and relatively mature body of work on £i-regularization for 
estimation of sparse models, dating back to the introduction of the Lasso and basis pursuit [28, 
[5]. With contributions from various researchers (e.g., [71 119 1 129" ! 1371 13])- there is now a fairly 
complete theory of the behavior of the Lasso for high-dimensional sparse estimation. A more 
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recent line of work (e.g., [3H [351 Ell GEH [36])) motivated by applications in which block or 
hierarchical structure arises, has proposed the use of block £ a ^ norms for various a,b 6 [1, oo]. 
Of particular relevance to this paper is the block £\/£oo norm, proposed initially by Turlach 
et al. [31] and Tropp et al. [30]. This form of block regularization is a special case of the more 
general family of composite or hierarchical penalties, as studied by Zhao et al. [36] . 

Various authors have empirically demonstrated that block regularization schemes can yield 
better performance for different data sets [361 [22| I14j . Some recent work by Bach [1] has 
provided consistency results for I1/I2 block-regularization schemes under classical scaling, 
meaning that n — > +00 with p fixed. Meier et al. [18] has established high-dimensional 
consistency for the predictive risk of £±/£2 block-regularized logistic regression. The papers |15l 
[2T| [23] have provided high-dimensional consistency results for £\ j£ q block regularization for 
support recovery using fixed design matrices, but the rates do not provide sharp differences 
between the case q = 1 and q > 1. 

To date, there has a relatively limited amount of theoretical work characterizing if and 
when the use of block regularization schemes actually leads to gains in statistical efficiency. As 
we elaborate below, this question is significant due to the greater computational cost involved 
in solving block-regularized convex programs. In the case of £1/^2 regularization, concurrent 
work by Obozinski et al [23] (involving a subset of the current authors) has shown that that 
the £i/£2 method can yield statistical gains up to a factor of r, the number of separate re- 
gression problems; more recent concurrent work [9], [16] has provided related high-dimensional 
consistency results for £i/£2 regularization, emphasizing the gains when the number of tasks 
r is much larger than log p. 

This paper considers this issue in the context of variable selection using block £\/£oo 
regularization. Our main contribution is to obtain some precise — and arguably surprising — 
insights into the benefits and dangers of using block £\j£oo regularization, as compared to 
simpler ^i-regularization (separate Lasso for each regression problem). We begin by providing 
a general set of sufficient conditions for consistent support recovery for both fixed design 
matrices, and random Gaussian design matrices. In addition to these basic consistency results, 
we then seek to characterize rates, for the particular case of standard Gaussian designs, in a 
manner precise enough to address the following questions: 

(a) First, under what structural assumptions on the data does the use of £i/£oo block- 
regularization provide a quantifiable reduction in the scaling of the sample size n, as 
a function of the problem dimension p and other structural parameters, required for 
consistency? 

(b) Second, are there any settings in which £\/£oo block-regularization can be harmful rel- 
ative to computationally less expensive procedures? 

Answers to these questions yield useful insight into the tradeoff between computational and sta- 
tistical efficiency in high-dimensional inference. Indeed, the convex programs that arise from 
using block-regularization typically require a greater computational cost to solve. Accord- 
ingly, it is important to understand under what conditions this increased computational cost 
guarantees that fewer samples are required for achieving a fixed level of statistical accuracy. 

The analysis of this paper gives conditions on the designs and regression matrix B for which 
£\/£oo yields improvements (question (a)), and also shows that if there is sufficient mismatch 
between the regression matrix B and the £\j£oo norm, then use of this regularizer actually 
impairs statistical efficiency relative to a naive £i-approach. As a representative instance of 
our theory, consider the special case of standard Gaussian design matrices and two regression 
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problems (r = 2), with the supports S0 1 ) and S(/3 2 ) each of size s and overlapping in a 
fraction a G [0, 1] of their entries. For this problem, we prove that block £i/£oo regularization 
undergoes a phase transition — meaning a sharp threshold between success and recovery — that 
is specified by the rescaled sample size 



Oi,oo(n,p,s,a) 



n 



(4-3a)«log(p- (2-a)s) 



(1) 



In words, for any 5 > and for scalings of the quadruple (n,p, s, a) such that 0i )OO > 1 + <5, 
the probability of successfully recovering both S^/S 1 ) and S(j3 2 ) converges to one, whereas for 
scalings such that 9\ t00 < 1 — 8, the probability of success converges to zero. 

Figure Q] illustrates how the theoretical threshold dH) agrees with the behavior observed 
in practice. This figure plots the probability of successful recovery using the block £i/loc 
approach versus the rescaled sample size n/{2s\og[p — (2 — a)s}; the results shown here 
are for r = 2 regression parameters. The plots show twelve curves, corresponding to three 
different problem sizes p £ {128, 256, 512} and four different values of the overlap parameter 
a £ {0.1, 0.3, 0.7, 1}. First, let us focus on the set of curves labeled with a = 1, corresponding 
to case of complete overlap between the regression vectors. Notice how the curves for all three 
problem sizes p, when plotted versus the rescaled sample size, line up with one another; this 
"stacking effect" shows that the rescaled sample size captures the phase transition behavior. 
Similarly, for other choices of the overlap, the sets of three curves (over problem size p) exhibit 
the same stacking behavior. Secondly, note that the results are consistent with the theoretical 
prediction (TfJ): the stacks of curves shift to the right as the overlap parameter a decreases 

(■l oo relaxation for s = 0.1*p and a = 1,0.7,0.4,0.1 




2 3 
Control parameter 



Figure 1. Probability of success in recovering the joint signed supports plotted against the 
rescaled sample size ^Las : = n/[2s\og(p — (2 — a)s))] for linear sparsity s = O.lp. Each stack 
of graphs corresponds to a fixed overlap a, as labeled on the figure. The three curves within 
each stack correspond to problem sizes p E {128, 256, 512}; note how they all align with each 
other and exhibit step-like behavior, consistent with Theorem [3l The vertical lines correspond 
to the thresholds 9\ x (a) predicted by Theorem [3J note the close agreement between theory 
and simulation. 

from 1 towards 0, showing that problems with less overlap require a larger rescaled sample 
size. More interesting is the sharpness of agreement in quantitative terms: the vertical lines 
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in the center of each stack show the point at which our theory (TjQ) predicts that the method 
should transition from failure to success. 

By comparison to previous theory on the behavior of the Lasso (ordinary ^-regularized 
quadratic programming), the scaling (TjQ) has two interesting implications. For the s-sparse re- 
gression problem with standard Gaussian designs, the Lasso has been shown [33] to transition 
from success to failure as a function of the rescaled sample size 

77 

0Las(n,p,s) := — — r. (2) 

2s log(p — s) 

In particular, under the conditions imposed here, solving two separate Lasso problems, one 
for each regression problem, would recover both supports for problem sequences (n,p, s) such 
that #Las > 1- Thus, one consequence of our analysis is to characterize the relative statistical 
efficiency of £\/£oo regularization versus ordinary ^-regularization, as described by the ratio 



Our theory predicts that (disregarding some o(l) factors) the relative efficiency scales as 
R(a) ~ ~2° ' i which (as we show later) shows excellent agreement with empirical behavior 
in simulation. Our characterization of R(a) confirms that if the regression matrix B is well- 
aligned with the block £i/£oo regularizer — more specifically for overlaps a £ [|, 1] — then block- 
regularization increases statistical efficiency. On the other hand, our analysis also conveys a 
cautionary message: if the overlap is too small — more precisely, if a < 2/3 — then block £i t00 
is actually relative to the naive Lasso-based approach. This fact illustrates that some care is 
required in the application of block regularization schemes. 

In terms of proof techniques, the analysis of this paper is considerably more delicate than 
the analogous arguments required to show support consistency for the Lasso [191 E21 157] . 
The major difference — and one that presents substantial technical challenges — is that the 
sub-differential! of the block £\/£oo is a much more subtle object than the subdifferential of 
the ordinary £i-norm. In particular, the £i-norm has an ordinary derivative whenever the 
coefficient vector is non-zero. In contrast, even for non-zero rows of the regression matrix, 
the block l x jl oo norm may be non-differentiable, and these non-differentiable points play a 
key role in our analysis. (See Section [4.11 for more detail on the sub-differential of this block 
norm.) As we show, it is the Frobenius norm of the sub-differential on the regression matrix 
support that controls high-dimensional scaling. For the ordinary ^i-norm, this Frobenius 
norm is always equal to s, whereas for matrices with r = 2 columns and a fraction overlap, 
this Frobenius norm can be as small as v§z|£!i£. As our analysis reveals, it is precisely the 
differing structures of these sub-differentials that leads to different high-dimensional scaling 
for l\ versus £i >00 regularization. 

The remainder of this paper is organized as follows. In Section [21 we provide a precise 
description of the problem. Section [3J is devoted to the statement of our main results, some 
discussion of their consequences, and illustration by comparison to empirical simulations. In 
Section [H we provide an outline of the proof, with the technical details of many intermediate 
lemmas deferred to the appendices. 

Notational conventions: For the convenience of the reader, we summarize here some 
notation to be used throughout the paper. We reserve the index i G {1, . . . , r} as a superscript 

2 As we describe in more detail in Section r4.ll the sub-differential is the appropriate generalization of gradient 
to convex functions that are allowed to have "corners", like the t\ and £i/£oo norms; the standard books 26 8 
contain more background on sub-differentials and their properties. 



5 



in indexing the different regression problems, or equivalently the columns of the matrix B E 
M pxr . Given a design matrix X E M nxp and a subset S C {1, . . . ,p}, we use X s to denote the 
n x IS"! sub-matrix obtained by extracting those columns indexed by S. For a pair of matrices 
A E M mx£ and B E M mxn , we use the notation (A, JB) : = A T B for the resulting £ x n matrix. 

We use the following standard asymptotic notation: for functions f,g, the notation 
f(n) = 0{g{n)) means that there exists a fixed constant < C < +00 such that f(n) < Cg(n); 
the notation f{n) = 0,(g(n)) means that f(n) > Cg(n), and f(n) = Q(g(n)) means that 
/(«) = 0(g(n)) and f(n) = Sl(g(n)). 

2 Problem set-up 

We begin by setting up the problem to be studied in this paper, including multivariate re- 
gression and family of block-regularized programs for estimating sparse vectors. 

2.1 Multivariate regression and block regularization schemes 

In this paper, we consider the following form of multivariate regression. For each i = 1, . . . , r, 
let (5 l E W be a regression vector, and consider the r-variate linear regression problem 

f = X i i + w i , i = l,2,...,r. (3) 

Here each X 1 E M nxp is a design matrix, possibly different for each vector (3 l , and w l E W 1 
is a noise vector. We assume that the noise vectors w l and w 3 are independent for different 
regression problems i ^ j. In this paper, we assume that each w % has a multivariate Gaussian 
N(0,a 2 I nxn ) distribution. However, we note that qualitatively similar results will hold for 
any noise distribution with sub-Gaussian tails (see the book [3] for more background on sub- 
Gaussian variates). 

For compactness in notation, we frequently use B to denote the pxr matrix with /3 ! £ l p 
as the i th column. Given a parameter q E [1, 00], we define the t\jlq block-norm as follows: 

\\B\Wit q ■= ElK4S/^---,/3/DII<p (4) 
k=i 

corresponding to applying the l q norm to each row of B, and the ^i-norm across all of 
these blocks. We note that all of these block norms are special cases of the CAP family of 
penalties [36J. 

This family of block-regularizers (JH) suggests a natural family of M-estimators for esti- 
mating B, based on solving the block-^i /^-regularized quadratic program 

B E arg min {J- V \\y' - X*/?\\l + Xn\\B\\ tl/iq }, (5) 

1=1 

where A n > is a user-defined regularization parameter. Note that the data term is separable 
across the different regression problems i = 1, . . . ,r, due to our assumption of independence 
on the noise vectors. Any coupling between the different regression problems is induced by 
the block-norm regularization. 

In the special case of univariate regression (r = 1), the parameter q plays no role, and 
the block-regularized scheme ([6]) reduces to the Lasso [2H [5]. If q = 1 and r > 2, the 
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block-regularization function (like the data term) is separable across the different regression 
problems i = 1, . . . , r, and so the scheme ([6]) reduces to solving r separate Lasso problems. 
For r > 2 and q = 2, the program ([6]) is frequently referred to as the group Lasso |35t 122 j . 
Another important case [HU [30] and the focus of this paper is the setting q = oo and r > 2, 
which we refer to as block £\/£oa regularization. 

The motivation for using block t\ j regularization is to encourage shared sparsity among 
the columns of the regression matrix B. Geometrically, like the i\ norm that underlies the 
ordinary Lasso, the £i/£oo block norm has a polyhedral unit ball. However, the block norm 
captures potential interactions between the columns f3 l in the matrix B. Intuitively, taking 
the maximum encourages the elements (PliPk ■■■ iPk) m an y gi yen row k = 1, . . . ,p to be 
zero simultaneously, or to be non-zero simultaneously. Indeed, if /3l ^ for at least one 
i G {1, . . . , r}, then there is no additional penalty to have j3 k / as well, as long as \(3 k \ < \/3 l k \. 

2.2 Estimation in norm and support recovery 

For a given A n > 0, suppose that we solve the block £\/£oo program, thereby obtaining an 
estimate 

B G arg min {±- V) {{y* - X'^g + A„||i% /4o }, (6) 

i=l 

We note that under high-dimensional scaling (p 3> n), this convex program ([6|) is not neces- 
sarily strictly convex, since the quadratic term is rank deficient and the block £i/£oo norm is 
polyhedral, which implies that the program is not strictly convex. However, a consequence of 
our analysis is that under appropriate conditions, the optimal solution B is in fact unique. 

In this paper, we study the accuracy of the estimate B, as a function of the sample 
size n, regression dimensions p and r, and the sparsity index s = maxj = i v .. ir . There 
are various metrics with which to assess the "closeness" of the estimate B to the truth B, 
including predictive risk, various types of norm-based bounds on the difference B — B, and 
variable selection consistency. In this paper, we prove results bounding the £ooj£oo difference 

ll-B--B|lw<«, : = max max \B\-B l k \. 

k=l,...,p i=l,. ..,r 

In addition, we prove results on support recovery criteria. Recall that for each vector (3 l G MP, 
we use S(/3 l ) = {k \ f3 k ^ 0} to denote its support set. The problem of row support recovery 
corresponds to recovering the set 

r 

u ■.= LM/n (7) 

1=1 

corresponding to the subset U C {1, . . . ,p} of indices that are active in at least one regression 
problem. Note that the cardinality of \U\ is upper bounded by rs, but can be substantially 
smaller (as small as s) if there is overlap among the different supports. 

As discussed at more length in Appendix (A[ given an estimate of the row support of B, 
it is possible to either use additional structure of the solution B or perform some additional 
computation to recover individual signed supports of the columns of B. To be precise, define 
the sign function 

'+1 ift>0 

sign(t) = { if t = (8) 
-1 if t < 0. 
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Then the recovery of individual signed supports means estimating the signed vectors with 
entries sign(/3 fc l ), for each i = 1, 2, . . . ,r and for all k = 1, 2, . . . ,p. Interestingly, when using 
block £ x /l oo regularization, there are multiple ways in which the support (or signed support) 
can be estimated, depending on whether we use primal or dual information from an optimal 
solution. 

The dual recovery method involves the following steps. First, solve the block-regularized 
program ([6|), thereby obtaining an primal solution B € R pxr . For each row k = 1, . .. ,p, 
compute the set := arg max \{3£\. Estimate the support union via U = \J 5*(/3*), 

i=h-,r i=l,...,r 

and estimate the signed support vectors 



As our development will clarify, this procedure ([9j) corresponds to estimating the signed 
support on the basis of a dual optimal solution associated with the optimal primal solution. 
We discuss the primal-based recovery method and its differences with the dual-based method 
at more length in Appendix [Al 



3 Main results and their consequences 

In this section, we provide precise statements of the main results of this paper. Our first main 
result (Theorem[T]) provides sufficient conditions for deterministic design matrices X 1 , . . . , X r , 
whereas our second main result (Theorem [2]) provides sufficient conditions for design matrices 
drawn randomly from sub-Gaussian ensembles. Both of these results allow for an arbitrary 
number r of regression problems, and the random design case allows for random Gaussian 
designs X k with i.i.d. rows and covariance matrix S fc £ M pxp , k = 1, . . . ,r. Not surprisingly, 
these results show that the high-dimensional scaling of block £i/£oo is qualitatively similar 
to that of ordinary ^i-regularization: for instance, in the case of random Gaussian designs 
and bounded r, our sufficient conditions ensure that n = O(slogp) samples are sufficient to 
recover the union of supports correctly with high probability, which matches known results 
on the Lasso |33| . as well as known information-theoretic results on the problem of support 
recovery |32j. 

As discussed in the introduction, we are also interested in the more refined question: 
can we provide necessary and sufficient conditions that are sharp enough to reveal quantita- 
tive differences between ordinary £\ -regularization and block regularization? Addressing this 
question requires analysis that is sufficiently precise to control the constants in front of the 
rescaled sample size n/s log(p — s) that controls the performance of both £\ and block £i/£oo 
methods. Accordingly, in order to provide precise answers to this question, our final two 
results concern the special case of r = 2 regression problems, both with supports of size s that 
overlap in a fraction a of their entries, and with design matrices drawn randomly from the 
standard Gaussian ensemble. In this setting, our final result (Theorem [3]) shows that block 
£i/£oo regularization undergoes a phase transition — that is, a rapid change from failure to 
success — specified by the rescaled sample size 0\ jOO {n,p, s, a) previously defined ([I]). We then 
discuss some consequences of these results, and illustrate their sharpness with some simulation 
results. 




(9) 
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3.1 Sufficient conditions for general deterministic and random designs 

In addition to the sample size n, problem dimensions p and r, sparsity index s and over- 
lap parameter a, our results involve certain quantities associated with the design matrices 
X % . To begin, in the deterministic case, we assume that the columns of each design matrix 
X i , i = 1 , . . . , r are normalized^] so that 

11-XjH! ^ 2n for all k = l,2,...p. (10) 

More significantly, we require that the following incoherence condition on the design matrix 
be satisfied: 

r 

7 (E) := 1- max VlK^XM^,^))- 1 )]!! > 0. (11) 
e=i,...,\u c \ r~f 

For the case of the ordinary Lasso, conditions of this type are known |X9(, [371 133] to be both 
necessary and sufficient for successful support recovery]! 

In addition, the statement of our results involve certain quantities associated with the \ U\ x 
\U\ matrices -(Xjj, Xjj); in particular, we define a lower bound on the minimum eigenvalue 

C min (X) < min X^-iXjj, Xjj)), (12) 
i=l,...,r x n 

as well as an upper bound maximum -^oo^-operator norm of the inverses 

D max (X) > max ||| {-(XI, Xl))' 1 ^. (13) 

i=l,...,r n 

Remembering that our analysis applies to to sequences {X n p } of design matrices, in the 
simplest scenario, both of the bounding quantities C m i n and -D max do not scale with (n,p, s). 
To keep notation compact, we write C m i n and -D max in the analysis to follow. 
We also define the support minimum value 

B min = min max (14) 
corresponding to the minimum value of the ^ norm of any row k £ U. 

Theorem 1 (Sufficient conditions for deterministic designs). Consider the observation model (|3|) 
with design matrices X 1 satisfying the column bound (llOp and incoherence condition (1 1 1 H . 
Suppose that we solve the block-regularized l\jioo convex program ([6]) with regularization pa- 
rameter > j|^_ r +flog(p) j or some ^ > 1. Then with probability greater than 



MZ,P,s) ■= l-2exp(-(e-l)[r + logp])-2exp(-(e 2 -l)log(r S )), (15) 
we are guaranteed that 
(a) The block-regularized program has a unique solution B such that IJi=i S(f3 l ) C U . 



3 The choice of the factor 2 in this bound is for later technical convenience. 

4 Some work [20] has shown that multi-stage methods can allow some relaxation of this incoherence condition; 
however, as our main interest is in understanding the sample complexity of ordinary t\ versus t\ /l^, relaxations, 
we do not pursue such extensions here. 
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(b) Moreover, the solution satisfies the elementwise loo-bound 



\B-B\\ eoo/lx < Jp^ l ^B. + D max X n . (16) 



bx(£,\ n ,n,s) 

Consequently, as long as i? m i n > &i(£, X n , n, s), then UI=i ^(Z^*) = U, so that the solu- 
tion B correctly specifies the union of supports U. 

We now state an analogous result for random design matrices; in particular, consider 
the observation model (J3j) with design matrices X 1 chosen with i.i.d. rows from covariance 
matrices E\ In analogy to definitions (I12p and f)13f) in the deterministic case, we define the 
lower bound 

Cmin(S) < min A min (17) 

i=l,. ..,r 

as well as an analogous upper bound on ^oo-operator norm of the inverses 

Anax(S) > max 1 (Zhu)' 1 loo <Anax. (18) 
i=l, ...,r 

Note that unlike the case of deterministic designs, these quantities are not functions of the 
design matrix X , which is now a random variable. Finally, our results involve an analogous 
incoherence parameter of the covariance matrices S = i = 1, . . . , r}, defined as 

r 

7 (E) := 1- max VllSl^^)- 1 ^ > 0. (19) 
«=l,..., \U C t~: 

2 = 1 

With this notation, the following result provides an analog of Theorem[T]for random design 
matrices: 

Theorem 2 (Sufficient conditions for random Gaussian designs). Suppose that we are given 
n i.i.d. observations from the model ([3]) with 

8 K r , \ 
n > — ^s{r + \ogp) (20) 

for some k > 1 . If we solve the convex program ([6]) with regularization parameter satisfying 
An > [ r +r ^ og ( p ) ] for some £ > 1, then with probability greater than 

M">t,n>P>*) ■= l"2exp{ -2(£ 2 - l)log(rs)} -2exp{ - K ( r + \ogp)} 1, (21) 
we are guaranteed that 

(a) The block-regularized program © has a unique solution B such that (Ji=i S{(3 1 ) C U. 

(b) The solution satisfies the elementwise loo bound 



|S-S|| W< „ < f Ji^!l2iM + A„[i| + i>„„], (22) 



Consequently, if B* nin > 62(£> A„, n, s), then [J r i=1 S(f3 l ) = U, so that the solution B 
correctly specifies the union of supports U. 
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To clarify the interpretation of Theorems [T] and Theorem [21 part (a) of each claim guar- 
antees that the estimator has no false inclusions, in that the row support of the estimate B is 
contained within the row support of the true matrix B. One consequence of part (b) is that 
as long as the minimum signal parameter B^ n decays slowly enough, then the estimators 
have no false exclusions, so that the true row support is correctly recovered. 

In terms of consistency rates in block ^oo/^oo norm, assuming that the design-related 
quantities C m i n , D max and 7 do not scale with p, Theorem [2(a) guarantees consistency in 
elementwise loo-norm at the rate 
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Here we have used the fact that log|£7| < log(rs) = o(r log p). Similarly, Theorem [2Kb) 
guarantees consistency in elementwise ^-norm at the rate 



\B-B\U /t = 0( ff 2 Jmax{l,-^} r^ + rlogp , 
1 IIW*«> V y rlogp J n } 

In this expression, the extra term max{l, sjir log p)} arises in the analysis due to the need 
to control the norms of the random design matrices. For sufficiently sparse problems (e.g., 
s = O(logp)), this factor is constant. 



At a high level, our results thus far show that for a fixed number r of regression prob- 
lems, the i\jioo method guarantees exact support recovery with n = Q(slogp) samples, and 

guarantees consistency in an elementwise sense at rate O(y^-^p). In qualitative terms, these 
results match the known scaling [33] for the Lasso (^i-regularized QP), which is obtained as 
the special case for univariate regression (r = 1). It should be noted that this scaling is known 
to be optimal in an information-theoretic sense: no algorithm can recover support correctly 
if the rescaled sample size #Las = 2siog"p-s) * s De l° w a critical threshold [321 El]- 

3.2 A phase transition for standard Gaussian ensembles 

In order to provide keener insight into the advantages and/or disadvantages associated with 
using £i/£oc block regularization, we need to obtain even sharper results, ones that are capable 
of distinguishing constants in front of the rescaled sample size #Las- With this aim in mind, 
the following results are specialized to the case of r = 2 regression problems, where the corre- 
sponding design matrices X l ,i = 1,2 are sampled from the standard Gaussian ensemble — i.e., 
with i.i.d. rows N(0, I pxp ). By studying this simpler class of problems, we can make quan- 
titative comparisons to the sample complexity of the Lasso, which provide insight into the 
benefits and dangers of block t\f £00 regularization. 

The main result of this section asserts that there is a phase transition in the performance 
of l±/£oo quadratic programming for suppport recovery — by which we mean a sharp transition 
from failure to success — and provide the exact location of this transition point as a function 
of (n,p,s) and the overlap parameter a £ (0,1). The phase transition involves the support 
gap 

£ gap = max _ (23) 

ieS(J3 l )nS(J3 2 ) 
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This quantity measures how close the two regression vectors are in absolute value on their 
shared support. Our main theorem treats the case in which this gap vanishes (i.e., B gap = o(l)); 
note that block £\/£oo regularization is best-suited to this type of structure. A subsequent 
corollary provides more general but technical conditions for the cases of non-vanishing support 
gaps. Our main result specifies a phase transition in terms of the rescaled sample size 

n 

9 lj00 {n,p,s,a) := T — — — (24) 

(4 — 3a)s log(p — (2 — a)s) 

as stated in the theorem below. 

Theorem 3 (Phase transition). Consider sequences of problems, indexed by (n,p,s,a) drawn 
from the observation model ([3]) with random design X drawn with i.i.d. standard Gaussian 
entries and with C m \ n = 1 = D max . 

(a) Success: Suppose that the problem sequence (n,p,s,a) satisfies 

8i,oo(n,p, s, a) > 1 + 5 for some 5 > 0. (25) 



If we solve the block-regularized program © with X n > \J ^ a ^° g ^ for some £ > 2 and 
-Bgap = o(X n ), then with probability greater than 1 — c\ exp(— C2log(p — (2 — a)s)), the 
block £i t00 -program ([6]) has a unique solution B such that S{B) C U, and moreover it 
satisfies the elementwise bound ([22]) with C m \ n = 1 = -D max . In addition, if B^ n > 
&2(£> A n , n , s), then the unique solution recovers the correct signed support. 

(b) Failure: For problem sequences (n,p,s,a) such that 

Gi,oo(n,p,s,a) < 1 — 5 for some 5 > (26) 

and for any non-increasing regularization sequence X n > 0, no solution B = {(3 l ,f3 2 ) to 
the block-regularized program ([6]) has the correct signed support. 

In a nutshell, Theorem[3]states that block £i/£oo regularization recovers the correct support 
with high probability for sequences (n,p,s,a) such that 9i tOC (n,p, s,a) > 1, and otherwise 
fails with high probability. 

We now consider the case in which the support gap does not vanish, and show that it only 
further degrades the performance of block £\/£oa regularization. To make the degree of this 
degradation precise, we define the A n -truncated gap vector T\ n (B) G MP, with elements 



[Tx n (B)]i 




otherwise 



Recall that support overlap S^ 1 ) n S{f5 2 ) has cardinality as by assumption. Therefore, 
T\ n (B) has at most as non-zero entries, and moreover ||?A n (B)||2 < X^as. We then define 
the rescaled gap limit 

A(B,X n ) := lim sup Mk^I. (27 ) 



(n,p,s) 



S 



Note that A(B, X n ) G [0, a] by construction. With these definitions, we have the following: 
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Corollary 1 (Poorer performance with non- vanishing gap). If for any 5 > 0, the sample size 
n is upper bounded as 

n < (1-5) [(4 - 3a) + A(£,A n )]slog[p- (2 -a)s], (28) 

then the dual recovery method Q fails to recover the individual signed supports. 

To understand the implications of this result, suppose that all as of the gaps | — \f3^\ | 
were above the regularization level \ n . Then by definition, we have A(B, \ n ) = a, so that 
condition (|28p implies that the method fails for all n < (1 — 5) [4 — 2a]slog[p — (2 — a) si. 
Since the factor (4 — 2a) is strictly greater than 2 for all a < 1, this scaling is always uwrsqj 
than the Lasso scaling given by n x 2slog(p — s) (see equation ([2])), unless there is perfect 
overlap (a = 1), in which case it yields no improvements. Consequently, Corollary [T] shows 
that the performance ii/too regularization is also very sensitive to the numerical amplitudes 
of the signal vectors. 



3.3 Illustrative simulations and some consequences 

In this section, we provide some simulation results to illustrate the phase transition predicted 
by Theorem [3] Interestingly, these results show that the theory provides an accurate de- 
scription of practice even for relatively small problem sizes (e.g., p = 128). As specified in 
Theorem [3j we simulate multivariate regression problems with r = 2 columns, with the design 
matrices X 1 drawn from the standard Gaussian ensemble. In all cases, we initially solved the 
■^l/^oo program using MATLAB, and then verified that the behavior of the solution agreed 
with the primal-dual optimality conditions specified by our theory. In subsequent simulations, 
we solved directly for the dual variables, and then checked whether or not the dual feasibility 
conditions are met. 

We first illustrate the difference between unsealed and rescaled plots of the empirical 
performance, which demonstrate that the rescaled sample size n/[s\og(p — s)} specifies the 
high-dimensional scaling of block £i/£oo regularization. Figure Eta) shows the empirical be- 
havior of the block Ixjtoa method for joint support recovery. For these simulations, we applied 
the method to r = 2 regression problems with overlap a = 1, and to three different problem 
sizes p 6 {128, 256, 512}, in all cases with the sparsity index s = [0.1p\ . Each curve in panel 
(a) shows the probability of correct support recovery ¥[U = U] versus the raw sample size 
n. As would be expected, all the curves initially start at ¥[U = U] = 0, but then transition 
to 1 as n increases, with the transition taking place at larger and larger sampler sizes as p 
is increased. The purpose of the rescaling is to determine exactly how this transition point 
depends on the problem size p and other structural parameters (s and a). Figure [2]^b) shows 
the same simulation results, now plotted versus the rescaled sample size 9 : = n/[2s log(p — s)], 
which is the appropriate rescaling predicted by our theory. Notice how all three curves now 
lie on top of another, and moreover transition from failure to success at 9 ~ 1, consistent with 
our theoretical predictions. 

We now seek to explore the dependence of the sample size on the overlap fraction a £ [0, 1] 
of the two regression vectors. For this purpose, we plot the probability of successful recovery 
versus the rescaled sample size 

n 

0i,oo(n,p,s,a) = — — — — r—. 

(4 — 3a)s log(p — (2 — a)s) 

5 Here we are assuming that s/p = o(l), so that log(p — s) x \og[p — (2 — a)s]. 
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Figure 2. (a) Plots of the probability F[U = U] of successful joint support recovery versus 
the sample size n. Each curve corresponds to a different problem size p; notice how the curves 
shift to the right as p increases, reflecting the difficulty of solving larger problems, (b) Plots of 
the same data versus the rescaled sample size n/[2s\og(p — s)]; note how all three curves now 
align with one another, showing that this order parameter is the correct scaling for assessing 
the method. 



As shown by FigureE^b), when plotted with this rescaling, there is any longer size p. Moreover, 
if we choose the sparsity index s to grow in a fixed way with p (i.e., s = f{p) for some fixed 
function /), then the only remaining free variable is the overlap parameter a. Note that the 
theory predicts that the required sample size should decrease as a increases towards 1. 

As shown earlier in Section Figure Q] plots the probability of successful recovery of the 
joint supports versus the rescaled samples size 0i tOO (n,p, s, a). Notice that the plot shows 
four sets of 'stacked" curves, where each stack corresponds to a different choice of the overlap 
parameter, ranging from a = 1 (left-most stack), to a = 0.1 (right-most stack). Each stack 
contains three curves, corresponding to the problem sizes p £ {128,256,512}. In all cases, 
we fixed the support size s = O.lp. As with Figure EJb), the "stacking" behavior of these 
curves demonstrates that Theorem [3] isolates the correct dependence on p. Moreover, their 
step-like behavior is consistent with the theoretical prediction of a phase transition. Notice 
how the curves shift towards the left as the overlap parameter a parameter increases towards 
one, reflecting that the problems become easier as the amount of shared sparsity increases. 
To assess this shift in a qualitative manner for each choice of overlap a £ {0.1,0.3.0.7.1}, 
we plot a vertical line within each group, which is obtained as the threshold value of #i )0 o 
predicted by our theory. Observe how the theoretical value shows excellent agreement with 
the empirical behavior. 

As noted previously in Section [fl Theorem [3] has some interesting consequences, particu- 
larly in comparison to the behavior of the "naive" Lasso-based individual decoding of signed 
supports — that is, the method that simply applies the Lasso (ordinary £i-regularization) to 
each column i = 1,2 separately. By known results [33] on the Lasso, the performance of 
this naive approach is governed by the order parameter 6^ as (n,p, s) = 2 s\og(p-s) ' meanrn g 
that for any 5 > 0, it succeeds for sequences such that 6>Las > 1 + (5, and conversely fails 
for sequences such that 6*Las < 1 — 6. To compare the two methods, we define the relative 
efficiency coefficient R(0i tOO , #Las) := &has{n-,p, s)/9i t00 (n,p, s, a). A value of R < 1 implies 
that the block method is more efficient, while R > 1 implies that the naive method is more 
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efficient. With this notation, we have the following: 

Corollary 2. The relative efficiency of the block £i tOQ program ([6]) compared to the Lasso is 
given by R(9i i00 , #Las) = 4 ~ 2 3 " 1 ° g [o~Q, 2 7 s °^^ • Thus, for sublinear sparsity s/p — ► 0, i/te Woc& 
scheme has greater statistical efficiency for all overlaps a £ (2/3,1], but lower statistical 
efficiency for overlaps a £ [0, 2/3). 



Scaling factor versus a 




0.2 OA 0.6 0^8 1 

Overlap parameter a 

Figure 3. Plots of the relative statistical efficiency R(a) of a method based on block-^i/^oo 
regularization versus the Lasso (ordinary £i-regularization). For each value of the parameter 
a G [0, 1] that measures overlap between the regression problems, the quantity R(a) is the 
ratio of sample size required by an estimator based on block l\ /^oo-regularization relative to 
the sample size required by the Lasso (ordinary £i-regularization). The error criterion here is 
recovery of the correct subset of active variables in the regression. Over a range of overlaps, the 
empirical thresholds of the l\ /loo block regularization method closely align with the theoretical 
prediction of (4 — 3a)/2. The block-based method begins to give benefits versus the "naive" 
Lasso-based method at the critical overlap a* ss 2/3, at which point the relative efficiency R(a) 
first drops below 1. For overlaps a E [0, 2/3), the joint method actually requires more samples 
than the naive method. 

Figure [3] provides an alternative perspective on the data, where we have plotted how 
the sample size required by block regression changes as a function of the overlap parameter 
a E [0, 1]. Each set of data points plots a scaled form of the sample size required to hit 50% 
success, for a range of overlaps, and the straight line (4 — 3a)/2 that is predicted by Theorem El 
Note the excellent agreement between the experimental results, for all three problem sizes for 
p £ {128, 256, 512}, and the full range of overlaps. The line (4 — 3a)/2 also characterizes the 
relative efficiency R of block regularization versus the naive Lasso-based method, as described 
in Corollary [2j For overlaps a > 2/3, this parameter R drops below 1. On the other hand, 
for overlaps a < 1, we have R > 1, so that applying the joint optimization problem actually 
decreases statistical efficiency. Intuitively, although there is still some fraction of overlap, the 
regularization is misleading, in that it tries to enforce a higher degree of shared sparsity than 
is actually present in the data. 
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4 Proofs 



This section contains the proofs of our three theorems. Our proofs are constructive in na- 
ture, based on a procedure that constructs pair of matrices B = (/? ,..., /3 r ) 6 MP xr and 
Z = (z , . . . ,z r ) G K pxr . The goal of the construction is to show that matrix B is an op- 
timal primal solution to the convex program ©, and that the matrix Z is a corresponding 
dual-optimal solution, meaning that it belongs to the sub-differential of the £i i00 -norm (see 
Lemma [1]), evaluated at B. If the construction succeeds, then the pair (B, Z) acts as a wit- 
ness for the success of the convex program ([6]) in recovering the correct signed support — in 
particular, success of the primal-dual witness procedure implies that B is the unique optimal 
solution of the convex program ([6]), with its row support contained with U. To be clear, the 
procedure for constructing this candidate primal-dual solution is not a practical algorithm 
(as it exploits knowledge of the true support sets) , but rather a proof technique for certifying 
the correctness of the block-regularized program. 

We begin by providing some background on the sub-differential of the li/loo norm; we 
refer the reader to the books |26|, [8] for more background on convex analysis. 



4.1 Structure of 4/loo-norm sub-differential 

The sub-differential of a convex function / : M. d — > M at a point x 6 M d is the set of all vectors 
y 6 M d such that f{x') > f(x) + (y, x' — x) for all x' G R . See the standard references |26} 18] 
for background on subdifferentials and their properties. 

We state for future reference a characterization of the sub-differential of the £i/£oo block 
norm: 

Lemma 1. The matrix Z G W xr belongs to the sub- differential 9||5||^ 1 ^ oo if and only if the 
following conditions hold for each k = 1, . . . ,p. 

(i) If P' l k ^ for at least one index i £ {1, . . . , r), then 



z k 



U signal) ifi£M k 
otherwise. 



where M& := arg max |/3jU ; for a set of non-negative scalars {tj, i £ M^} such that 
i=l,...,r 

X]ieM fc ti = 1- 

fiij If (3 l k = for all i = 1, . . . , r, i/ien we require Yll=i l^fcl — 1- 



4.2 Primal-dual construction 

We now describe our method for constructing the matrix pair (B, Z). Recalling that U = Ui=i 
denotes the union of supports of the true regression vectors, let U c denote the complement of 
{1, . . . ,p}\U. With this notation, Figure U] provides the four steps of the primal-dual witness 
construction. 

The following lemma summarizes the utility of the primal-dual witness method: 

Lemma 2. Suppose that for each i = 1, . . . , r, the \U\ x \U\ sub-matrix (Xfj, Xjj) is invertible. 
Then for any X n > 0, we have the following correspondences: 
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Primal-dual witness construction: 

(A) First, we solve the restricted program 



1 r 

5 = arg min { — ]T \\y l - X 1 (3% + X n \\B\\ ei/e J . (29) 



1=1 



Given our assumption that the x |t/| sub-matrices (Xy, X\j) are invertible, 
the solution to this convex program is unique. Moreover, note that Bjjc = 
by construction. 



(B) We choose Z v G Wd u \ xr as an element of the sub differential 



/loo' 



(C) Using the optimality conditions associated with the original convex pro- 
gram §§§ , we then solve for the matrix Zyc , and verify that its rows satisfy the 
strict dual feasibility condition 

r 

J2\z%\ < 1 for all k G U c . (30) 
i=l 

(D) A final (optional) step is to verify that B\j satisfies the sign consistency con- 
ditions sign(Bu) = sign(Bjj). 

Figure 4. Steps in the primal-dual witness construction. Steps (A) and (B) are straight- 
forward; the main difficulties lie in verifying the strict dual feasibility and sign consistency 
conditions stated in step (C) and (D). 



(i) If steps (A) through (C) of the primal- dual construction succeed, then (Bjj,0) G IR pxr is 
the unique optimal solution of the original convex program (0). 

(ii) Conversely, suppose that there is a solution B G W xr to the convex program © with 
support contained within U. Then steps (A) through (C) of the primal-dual witness 
construction succeed. 

We provide the proof of Lemma [2] in Appendix lD.2l It is convex-analytic in nature, based on 
exploiting the subgradient optimality conditions associated with both the restricted convex 
program (|29p and the original program ©, and performing some algebra to characterize 
when the convex program recovers the correct signed support. Lemma [2] lies at the heart of 
all three of our theorems. In particular, the positive results of Theorem [TJ Theorem [2] and 
Theorem [3{a) are based on claims (i) and (hi), which show that it is sufficient to verify that 
the primal-dual witness construction succeeds with high probability. The negative result of 
Theorem EJb), in contrast, is based on part (ii), which can be restated as asserting that if the 
primal-dual witness construction fails, then no solution has support contained with U. 

Before proceeding to the proofs themselves, we introduce some additional notation and 
develop some auxiliary results concerning the primal-dual witness procedure, to be used in 
subsequent development. With reference to steps (A) and (B), we show in Appendix ID. 21 that 
unique solution Bjj has the form 

B v = B v + A v , (31) 
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where the matrix Ajj £ Ml c/ 1 xr has columns 

' ' ' ' for i = l,...,r, (32) 



^(Xfj, w l )-\ n z v 



and Zy is the i th column of the sub- gradient matrix Zjj. 

With reference to step (C), we obtain the candidate dual solution Zijc £ ]Rl f/c l xr as follows. 
For each i = 1, . . . , r, let ILjfi denote the orthogonal projection onto the range of X\j. Using 

the sub-matrix Zjj £ Rl [/ I xr obtained from step (B), we define column i of the matrix Zjjc as 
follows: 

z Uc = ^-{X\j a , (I-U Xh )w^ + -(Xl /c ,Xl / (-(Xl J ,Xh))- 1 zl / ) for* = l,...,r. (33) 

See the end of Appendix ID, 21 for derivation of this condition. 

Finally, in order to further simplify notation in our proofs, for each k £ U c , we define the 
random variable 



Vk ■= Ei^i ( 34 ) 



i=i 

With this notation, the strict dual feasibility condition (|30p is equivalent to the event {max Vk < 1}. 

5 Proof of Theorem [I] 

We begin by establishing a set of sufficient conditions for deterministic design matrices, as 
stated in Theorem [TJ 

5.1 Establishing strict dual feasibility 

We begin by obtaining control on the probability of the event £(V), so as to show that step 
(C) of the primal-dual witness construction succeeds. Recall that IT^ denotes the orthogonal 

projection onto the range space of Xy, and the definition (jlip of the incoherence parameter 
7 £ (0, 1]. By the mutual incoherence condition (|11|) . we have 



max 

keu c 



{zZ\l( X l X h Xh)r l zti)} < 1-7, (35) 



i=l 

where we have used the fact that Y^l=i \ = 1 f° r each j £ U. Recalling that Vk = Yli=i 
and using the definition (i33"j) . we have by triangle inequality 

P[maxVfc > 1] < P[A(7)], 

where we have defined the event 

A( 7 ) := {max£|-i-(X£, (/ - n x <W)| > 7 }. (36) 

To analyze this remaining probability, for each index i = 1, . . . ,r and k £ U c , define the 
random variable 

^k ■■= ^-(Xl,(I-U Xh ))w\ (37) 
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Since the elements of the n- vector w l follow a N(0, a ) distribution, the variable Wl is zero- 
mean Gaussian with variance -^-^i(X l k , (I — H X i^)Xfy. Since [l-X^H^ < 2n by assumption and 
(/ — Il^i ) is an orthogonal projection matrix, the variance of each W k is upper bounded by 



T§— . Consequently, for any choice of sign vector b £ {— 1, +l} r , the variance of the zero- mean 

2ra 



Gaussian Y7i=i biW k is upper bounded by , r . 

Consequently, by taking the union bound over all sign vectors and over indices k 6 U c , 
we have 



P[A(7)1 = P f max max V b { Wl > j] < 2 exp ( - + r + \ ogp ) . 

i=i 

With the choice \ 2 n > M^l!±!^iM for 

some £ > 1, we conclude that 



P[£(V)] > l-2exp(-(f-l)[r + logp]) -> 1. 

By Lemma[2Ji), this event implies the uniqueness of the solution B, and moreover the inclusion 
of the supports S(B) C S(B), as claimed. 

5.2 Establishing bounds 

We now turn to establishing the claimed ^oo-bound (|16p on the difference B — B. We have 
already shown that this difference is exactly zero for rows in U c ; it remains to analyze the 
difference Ay = By — By. It suffices to prove the £oc bound for the columns A^ separately, 
for each i = 1, . . . ,r. 

We split the analysis of the random variable max^g^ |AjL| into two terms, based on the 
form of A from equation (|32p . one involving the dual variables Zy, and the other involving 
the observation noise w l , as follows: 

max|AI| < ||(^<^, ^»~ 1 ^<^ ) ^>lloo + ll«^^ 5 ^»~ lA -^lloo' 



The second term is easy to control: from the characterization of the subdifferential (Lemma[l|, 
we have H*, < 1, so that T* < A n |||((i^, X^))" 1 ^ < D max X n . 

Turning to the first term T*, we note that since X\j is fixed, the \XJ\ -dimensional random 

vector Y : = [{h^y, h{Xy, w l ) is zero-mean Gaussian, with covariance ^ [{h^ui ^u) 

Therefore, we have var(lfc) < j^- — , and can use this in standard Gaussian tail bounds. By 
applying the union bound twice, first over k G U, and then over i £ {1, 2, . . . , r}, we obtain 

P[ max T a >t] < 2exp(-t 2 nC min /(2) + log(rs) + logr), 

i=l,...,r 



where we have used the fact that \U\ < rs. Setting t = £ \ 4 !^s( rs ) yields that 



/ 4 log rs 

max max|A l fc | < f W — h D m3iX X n = : &i(£, A n , n, s), 

i=X,...,r k&U V C m in U 

with probability greater than 1 — 2exp(— (£ 2 — l)log(rs)), as claimed. 

Finally, to establish support recovery, recall that we proved above that A* is bounded by 
i>i(£, A n , n, s). Hence, as long as -B min > A n , n, s), then we are guaranteed that if B l k ^ 0, 
then B\ ^ 0. 
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6 Proof of Theorem [2] 



We now turn to the proof of Theorem O providing sufficient conditions for general Gaussian 
ensembles. Recall that for i = 1,2, ... , r, each X % E M nxp is a random design matrix, with 
rows drawn i.i.d. from a zero-mean Gaussian with p x p covariance matrix £\ 

6.1 Establishing strict dual feasibility 

Recalling that = Y^i=i l^fcl an d using the definition ([33]) . we have the decomposition 



Mi M 2 

In order to show that maxfcgjyc < 1 with high probability, we deal with each of these two 
terms in turn, showing that M\ < 7/2, and M2 < 1 — 7/2, both with high probability. 

In order to bound Mi, we require the following condition on the columns of the design 
matrices: 

Lemma 3. Let <r max = maxjS. For n > 21og(rp), each column of the design matrices 
X 1 , % = 1, . . . , r has controlled i^-norm: 

P[ max max \X\f % < 2a max nl < 2 exp ( - - + log(pr)) 0. (38) 
i=l,...,r fc=l,...,p 2 

This claim follows immediately by union bound and concentration results for x 2 -variates; in 
particular, the bound (|66ap in Appendix [El 

Under the condition of Lemma [31 each variable W % k := Xn(-^1' ~ ^-x i v ) wt ) ls zero- 
Gaussian, with variance at most ■j^- Consequently, for any choice of signs b £ {— l,+l} r , 

the vector X)I=i ^iW^ is zero- mean Gaussian, with variance at most ^r^-- Therefore, for any 
t > 0, we have 

r r 

Pfmax > tl = Pfmax max V* hWl > t] 

< 2 exp(--^-t 2 + r + logp) 

Setting t = 7/2 yields that 

A 2 n 

P[Mi > 7/2] < 2 exp ( - T-^2-7 2 + r + logp) . 

Lemma 4. Suppose that the design covariance matrices £\i = l,...,r satisfy the mutual 
incoherence condition (1111). T/ien we /iaue 



M 2 < (l-7) + m ax^|I<^,^((i^,^})- 1 4>|, (39) 

2 = 1 



M^ 

where each random vector G M nxl /ios i.i.d. iV(0, 1) entries, and is independent of w 1 and 
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See Appendix [B] for the proof of this claim. 



It remains to show that the random variable M' 2 defined in equation (|39p is upper bounded 
by 7/2 with high probability. Conditioning on X\j and w 1 , the scalar random variable 
^(Y" fc \ Xjj((^Xfj, A^)) -1 ?^) is zero-mean Gaussian, with variance upper bounded as 



1^- ,,l "~*" 2 



-{zi;, « Xfj, Xjj))- 1 ^) < 



n n C min n 

Recalling that YH=i \ = 1' f° r an y choice of signs b G {—1, +l} r , the variable 

iZ^Y^XuC-XhiXh))^) 

i=l 

is zero-mean Gaussian, with variance at most 7^ — . Therefore, we have 

V -1 

P[M^> 7 /2] < P[^^^ i} J^^-<^ J ^((-^,jr ir >)- 1 ^>|> 7 /2] 

< 2exp( 7 + r + logp). 

8rs 

This probability vanishes faster than 2exp { — n(r + logp)} — > 0, as long as 

8 k r , s 
n > — ^s{r + logp) 

6.2 Establishing 4o bounds 

We now turn to establishing the claimed £oo-bound (]16p on the difference B — B. As in the 
analogous portion of the proof of Theorem [H we use the decomposition 



In the setting of random design matrices, a bit more work is required to control these terms. 
Beginning with the second term, by triangle inequality, we have 

Tl < || [((— Xjj, Xjj)) - (J?uu) 1 )A n ^] ||oo + || {^bu) 1 ^nzi/\ 



OO 



— I [{(~ -^ui Xu)) 1 ~ (^bu) X ) IbAnV^ + D ma , x X n 



where we have used the facts that ||i^-||2 < y/s, since z^ belongs to the sub-differential of 
the block £±/loo norm (see Lemma [1]) so that Yli=i \ — 1 f° r an 3 ^ ^ ■ By, concentration 
bounds for eigenvalues of Gaussian random matrices (see equation (|69b[) in Appendix [E]) , we 
conclude that 

T\ < 4A nV ^ J- + AnaxA„ = A n [-^ + D max ] . 
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Now consider the first term T\: if we condition on Xy, then the \U\ -dimensional random 

vector Y : = Xjj}) h{Xjj, w l ) is zero-mean Gaussian, with covariance ^ [{^X^, -X|/)) 

By concentration bounds for eigenvalues of Gaussian random matrices (see equation ()69b|) in 
Appendix [Ej), we have 



since rs/n < 1. Therefore, we have shown that the variance of each element of Y is upper 
bounded by 5/(C m i n n), so that we can apply standard Gaussian tail bounds. By applying the 
union bound twice, first over k £ U, and then over i £ {1, 2, . . . , r}, we obtain 

P[ max T l a > t] < 2exp(-i 2 raC min /(50) + log \U\ + logr). 

i=l,...,r 

Setting t = £ a/ ^P^P yields that 

P[ max > t] < 2 exp { - 2£ 2 log(rs) + log(rs) + log r\ 

i=l,...,r 

< 2exp{-2(C 2 -l)log(r S )}, 
where we have used the fact that \U\ < rs. Combining the pieces, we conclude that 

, /lOOlog(rs) , r 4s „ , 
max \A\ < e J &y 1 +\ n — + £> max , 
fcet/ fcl y C min n l y/n J 

with probability greater than 

1 — 2exp { — 2(£ 2 — 1) log(rs)} — c\ exp(— C2n), 

as claimed. 

7 Proof of Theorem [3] 

We now turn to the proof of the phase transition predicted by Theorem O which applies to 
random design matrices X 1 and X 2 drawn from the standard Gaussian ensemble. This proof 
requires significantly more technical work than the preceding two proofs, since we need to 
control all the constants exactly, and to establish both necessary and sufficient conditions on 
the sample size. 

7.1 Proof of Theorem E2(a) 

We begin with the achievability result. Our proof parallels that of Theorems [1] and [21 in that 
we first establish strict dual feasibility, and then turn to proving £qq bounds and. exact support 
recovery. 



-\i((-xb, xb)y% < -{|||«-^,^»- 



5 

Cmin^ 
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7.1.1 Establishing strict dual feasibility 

Recalling that 14 = Y^=i |2jfel> we have 

maxlVfcl < Mi + M 2 , 

fce(7 c 

where the random variables M\ and M2 were defined at the start of Section 16,11 In order to 
prove that max/cgj/c |Vfe| < 1 with high probability for the values of n, s, and p, we will first 
establish that Mi < e/2 and M 2 < 1 — e for an appropriately chosen value of e. 
By the results from the previous section, we have Mi < e/2 with probability 

2 \ 2 2 

L max^|^| >e/2] < 2 exp ( - + 2 + logp) 

1=1 



Recall that 



2 



M 2 = max]T|I<X£, Xjj^Xh, Xfj))'^) 

i=l 



and that X\j c is independent of X\j and u>\ We will show that M 2 < 1 — e with high 
probability by using results on Gaussian extrema. Conditioning on (Xu, uu, Zjj), the random 
variable Y fc * = \(X^ X\j(MX\j, Xf J ))~ 1 'z^ is zero-mean with variance upper-bounded as 



1 1 1 119* II 2 

L i~i /<-•■—• „i,s_l~i\ lll^/ 1 ^ V« \ x — 1 III 2 \\ Z U\\2 



.{zljAi-Xh.Xh))- 1 ^) < |||(( -X^Xh)) 



II 2" 



n n n n 

Under the given conditioning, the random variables K 1 and Y^ 2 are independent and for 
any sign vector b £ {— 1,+1} 2 , the random variable X^=i ^^jt ls Gaussian, zero-mean with 
variance upper bounded as 



2 \\ Z U\\2 



£ll((^4»*tr»"V n 

i=i 

By Lemma [13 |||((iX^, X^))" 1 ^ < (1 + 5) with probability at least 1 — ci exp — c 2 ra for 
sufficiently large s and n under the given scaling for each i. Hence, Y^=i b%Y k l ls norma l> 
zero-mean, with variance upper bounded as 

( 1 + 5 ).^ ll ^„2 



n 



-U\\2 



Recall that Z\j was obtained from Step (B) of the Prima-dual witness construction. The 
next lemma provides control over Yli=i ll-^ylli- 

Lemma 5. Under the assumptions of Theorem\^and Corollary^ ifX^n — ► +00 and s/n — > 0, 
then Hiylll concentrated: for all 5 > 0, we have that for sufficiently large s and n 

P[l|2j7ll! + P#ll!< (l-6)U(4- 3a) + ^-\\B diS f 2 }] - 0, and (40a) 

P[ll^lll + ll^lla> (l + ^)|{(4-3a) + ^||B diff |||}] < ciexp(-c 2 n), (40b) 
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See Appendix [C] for the proof of this claim. 

Now, by applying the union bound and using Gaussian tail bounds, we obtain that the 
probability P[M2 > 1 — e] is upper bounded by 

ci exp(-c 2 n) + 4exp ( - (1 - e) 2 n/[(l + 5)s{(4 - 3a) + -rj-pdiff III}] + log(p - (2 - a)s)), 

A„S 



which goes to as n — * oo under the condition 

1 



> [(1 + 5)s{(4 - 3a) + -L||5 diff ||!}]/(1 _ e f i og ( p _ ( 2 _ a )s). 



7.2 Proof of Theorem [1(b) 

We now turn to the proof of the converse claim in Theorem [3[ We establish the claim by 
contradiction. We show that if a solution B exists such that Bjj c = 0, then under the stated 
upper bound on the sample size n, there exists some e > such that P[max(|2 fc 1 |-|-|2 fc 2 |) > 1 + e] 

converges to one. From the definition ([33]) . we see that conditioned on (Xu,w,z^), the 
variables {z^, k £ U c }} are i.i.d. zero-mean Gaussians, with variance given by 

var^ 1 ) := W^-U^w - -Xui-X^Xu)- 1 ^. 
\ n n n n 

By orthogonality, we have var(z fc 1 ) = || ^i^Il{7j_u;||| + W^Xui^X^Xu)" 1 !^]]^, so that (using 
the idempotency of projection operators), we have 

var(^) > a 2 := max{-L A inin ((-X^)- 1 ) MM) , (41) 

Note that a 2 = <x 2 (X;y, w, Zy) is a scalar random variable, but fixed under the conditioning. 
Turning to the variables {% 2 ,/c £ t/ c }, a similar argument shows that have var(i / 2 ) > a 2 , 
where (Xu,w,z£) is the analogous random variable. 

For k £ U c , let ~ N{0,a 2 ) and z 2 ~ A^O,? 2 ). We then have 

P[max(|^| + |J 2 |) > (1 + e)] > P[max |^| + \z 2 \ > 1 + e] 

fctu Kef 

> Pfmax^ 1 +^ 2 ) > 1 + el 



(6) 



[max > 1 + el , 



where ~ iV(0, cr 2 + <7 2 ). Here inequality (a) follows because a 2 and a 2 are lower bounds 
on the variances of {z£, k £ C/ c } and {z 2 , £ f/ c } respectively, and equality (b) follows since 
z~j and z 2 are independent zero-mean Gaussians with variances a 2 and <r 2 , respectively. 

To simplify notation, let N = \U C \ = p — (2 — a)s. By standard results for Gaussian 
maxima [13], for any S > 0, there exists an integer N(S) such that for all N > N(5), 



E[maxZ 7 ] > (1 - S)J2(a 2 + ct 2 ) log N. 

Moreover, the maximum function is Lipschitz, so that by Gaussian concentration for Lipschitz 
functions [E2 [12], for any rj > 0, we have 

Tj 2 

P[maxZo < E[maxZ,l — m < exp ( — ^ — ^r). 
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Combining these two statements yields that for all N > N(5), we have 

2 

P [ max Zj <(1-S) V^ 2 + ? 2 ) log N-rj\ < exp ( - 2 \ ). (42) 

It remains to show that there exists some e > such that Pfmax^gj/c < 1 + e] converges to 
zero. 

Case 1: First suppose that )? n n = 0(1). In this case, we have a 2 = Sl( l|n ^ Hli ) . With 
probability greater than 1 — ci exp(— C2«), this quantity is lower bounded by a constant, using 
concentration for x 2 -variates. In this case, \/2(a 2 + a 2 ) log N — rj — > +oo w.h.p., so that the 
result follows trivially. 

Case 2: Otherwise, we must have A 2 n — > +oo. Under this condition, we now establish a 
lower bound on <r 2 that holds with high probability; it will be seen that a similar lower bound 

■ | — x ||2 

holds for a 2 . We begin by noting the lower bound a 2 > „ \ m ^{(^Xjj Xjj)' 1 ) . To control 
the minimum eigenvalue, define the event 

T{Xu) := {Xu | X^Ui-XuXu)- 1 ) > (1 + v^)" 2 }- (43) 

By standard random matrix concentration arguments (see Appendix |E|), for some fixed c > 0, 
we are guaranteed that P[T C (X[/)] < 2exp(— en). Consequently, conditioned on T(Xjj), we 
have ____ _ 

a 2 + ~2 > l|4jji + J!Mi (l + v ^)~2. (44) 

n 

From Lemma O we note that if s/n = o(l), then for any 5 > 0, we have the lower bound 

a 2 + ~2 > (1 _ 5) ^_| (4 _ 3a) + (A(j B )An)) 2} (1 _ o(1)) _ (45) 

The following result is the final step in the proof of Theorem [3^b). 
Lemma 6. Suppose that A^ra — > +oo. Under this condition: 
(a) Iff t = then P[maxZ fc < 2] -> 0. 

Jf ~ — > 0, i/ten i/iere exists some e > suc/i i/iaf P[max < 1 + e] — > 0. 

k£U c 

Proof, (a) If - is bounded below by some constant c > 0, then we have 

a 2 > (1-a)- > (l-a)c, 
n 

which implies that (a 2 +a 2 ) log N —* +oo. Thus, setting (5 = 1/4 and n = \\j2{o 2 + a 2 ) log iV 
in equation (|42p yields that (for iV sufficiently large): 

P[maxZ fc < (- - (5)^2(0-2 + ct 2 ) log N] = P[maxZ fc < - ^/2(a 2 + a 2 ) log iV] 

k£U c 2 kau c 4 

< exp ( — ) -> 0. 

since ^2(0-2 + 5 2 ) log iV > 2 for N large enough, the claim follows. 
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(b) In this case, we may apply the lower bound (|45p . so that, for any 5 > 0, we have 
o* + a* > (l-5)J-{(4-3a) + (A(S,A n ))}(l-o(l)) 



with high probability. Since n < (1 — v) [(4 — 3a) + (A(£?, A n ))]s log Af by assumption, we have 



Consequently, from equation (|42h . for any 7/ > and 5 > 0, we have for all JV > N(S), 



Since z/ > 0, we may choose rj, 5 > sufficiently small so that for sufficiently large choices of 
(s,n), we have 



for some e > 0. Since from Lemma El the condition s/n = o(l) implies that a 2 + a 2 = o(l) 
w.h.p, we thus conclude that, using these choices of r] and S, we have 



8 Discussion 

In this paper, we provided a number of theoretical results that provide a sharp characteri- 
zation of when, and if so by how much the use of block £i/£oo regularization actually leads 
improvements in statistical efficiency in the problem of multivariate regression. As suggested 
in a body of past work, the use of block £\/£oo regularization is well- motivated in many ap- 
plication contexts. However, since it involves greater computational cost than more naive 
approaches, the question of whether this greater computational price yields statistical gains 
is an important one. 

This paper assessed statistical efficiency in terms of the number of samples required to 
recover the support exactly; however, one could imagine studying the same issue for related 
loss functions (e.g., ^2-loss or prediction loss), and it would be interesting to see if the results 
were qualitatively similar or not. Our results demonstrate that some care needs to be exercised 
in the application of £\/£oo regularization. Indeed, it can yield improved statistical efficiency 
when the regression matrix exhibits structured sparsity, with high overlaps among the sets of 
active coefficients within each column. However, our analysis shows that these improvements 
are quite sensitive to the exact structure of the regression matrix, and how well it aligns 
with the regularizing norm. When this alignment is not high enough, then the use of £\j£oo 
can actually impair performance relative to more naive (and less computationally intensive) 
schemes based on £i-regularization, such as the Lasso. Moreover, whether or not the £\/£oo 
yields statistical improvements is very sensitive to the actual magnitudes of the different 





(46) 





as claimed. 



□ 
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regression problems. In comparison to related results obtained by Obozinski et al. |23| on 
block regularization, the block £i/£oo exhibits some fragility, in that the conditions 

under which it actually improves statistical efficiency are delicate and easily violated. An 
interesting open direction is study whether or not it is possible to develop computationally 
efficient methods that are fully adaptive to the sparsity overlap-namely, methods that behave 
like ordinary £i-regularization when there is no or little shared sparsity, and behave like block 
regularization schemes in the presence of shared sparsity. 



A Recovering individual signed supports 

In this appendix, we discuss some issues associated with recovering individual signed supports. 
We begin by observing that once the support union U has been recovered, one can restrict the 
regression problem to this subset U, and then apply Lasso to each problem separately (with 
substantially lower cost, since each problem is now low-dimensional) in order to recover the 
individual signed supports. If one is not willing to perform some extra computation in this 
way, then the the interpretation of Theorems Q] and [2] — in terms of recovering the individual 
signed supports — requires a more delicate treatment, which we discuss in this appendix. 

Interestingly, the structure of the block £i/£oo norm permits two ways in which to recover 
the individual signed supports. 

^1/^00 primal recovery: Solve the block-regularized program (|6|), thereby obtaining a (pri- 
mal) optimal solution B £ M. pxr . Estimate the support union via U : = (J S(P l ), and and 

i=l,...,r 

estimate the signed support vectors via 

[Vi(^)k == sign(^). (47) 




t\jtoa dual recovery: Solve the block-regularized program (|6|), thereby obtaining an primal 
solution B £ MP xr . For each row k = l,...,p, compute the set := arg max \/3Z\. 

i=l,...,r 

Estimate the support union via U = [j S((3 l ), and estimate the signed support vectors 

i=X,...,r 

Sdua(P/JJ - \ n ±i ( 48 J 

otherwise. 

The procedure (08]) corresponds to estimating the signed support on the basis of a dual optimal 
solution associated with the optimal primal solution. 

The dual signed support recovery method (|48p is more conservative in estimating the in- 
dividual support sets. In particular, for any given i £ {1, . . . , r}, it only allows an index k 
to enter the signed support estimate §dua(/3 4 ) when achieves the maximum magnitude 
(possibly non-unique) across all indices i = 1, . . . ,r. Consequently, unlike the primal estima- 
tor pBj) , a corollary of Theorem [1] guarantees that the dual signed support method (pl8j) never 
suffers from false inclusions in the signed support set. On the other hand, unlike the primal 
estimator, it may incorrectly exclude indices of some supports — that is, it may exhibit false 
exclusions. 
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To provide a concrete illustration of this distinction, suppose that p = 4 and r = 3, and 
that the true matrix B and estimate take the following form: 



B 



2 





-3" 


2 


4 
























and B 



1.9 
1.7 







0.1 
3.9 







-2.9 
-0.1 







Consistent with the claims of Theorem [H the estimate B correctly recovers the support 
union — viz. S(B) = U = {1,2} = S(B). The primal (147j) and dual (|48h methods return the 
following estimates of the individual signed supports: 



§pri(5) 



1 1 -1 

1 1 -1 







and Sdua(-B) 









-1 





1 
























Consequently, the primal estimate includes false non-zeros in positions (1,2) and (2,3), 
whereas the dual estimate includes false zeros in positions (1, 1) and (2, 1). 

We note that it is possible to ensure that under some conditions that the dual support 
method (I48p will correctly recover each of the individual signed supports, without any incorrect 
exclusions. However, as illustrated by Theorem[3]and Corollary[TJ doing so requires additional 
assumptions on the size of the gap \/3l\ — \j3l\ for indices k G B : = S((3 l ) n S(P^). 



B Proof of Lemma [4] 

Note that conditioned Xjj, the rows of the random matrix X\j c are i.i.d. Gaussian random 
vectors with mean <|S^ 7C[/ (S[ /C/ )~ 1 , Xfj} and covariance 

— (Xlfc, Xjj({~XIj, X\j)) 1 ) = S l c/c[/ (S[ / ; 7 ) l z{j + —(Yjjc, Xu((— Xjj, X\j)) 1 ) 

71/ Tt Tli Tli 

where Y£ c ~ iV(0, E^). 

Using these expressions and triangle inequality, we obtain that M is upper bounded by 

Applying the mutual incoherence assumption (|19() . we obtain 

M < (l- 7 )+max^|i(^,^((i^,X^))- 1 ^>|, 



as claimed. 
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C Proof of Lemma [5] 



Recall that Zy = (z^ c ,z^), ||2^c ||| = (1 —ot)s, and that B is the set where \/3 B \ = |/3j|. Thus, 
the claim is equivalent to showing that ||2g||| is concentrated. If a = 0, then the claim is 
trivial, so that we may assume that a > 0. 
Recall that 

S x Zb = — {M 2 [M X +M 2 Y l f l -M l [M 2 + M l Y 1 f 2 } +M 1 [M 1 + M 2 ]~ 1 1- 

i[(M 1 )- 1 + (M 2 )- 1 ]- 1 S diff . (49) 

Using I • 1 2 to denote the spectral norm, we first claim that as long as s/n — > 0, then the 
following events hold with probability greater than 1 — c% exp(— c 2 n): 

lAf 1 — J|| 2 = 0(y/7Jn), (50a) 
IKM 1 ) -1 + (M 2 ) -1 ] -1 - 7/2 1 2 = O(v^), and (50b) 
\\M l [M l + M 2 y 1 - I/2\\ 2 = 0(y/ajn), (50c) 

as well as the analogous events with M 1 and M 2 interchanged. 

To verify the bound (|50a|) . we first diagonalize the projection matrix. All of its eigenvalues 
are or 1, and it has rank (n — s) w.p. one, so that we may write II BC ± = U T DU for some 
orthogonal matrix U, and the diagonal matrix D = diag{l n _ s , S }, 

M = n- l X^U T DUX B . 

But the projection II Bc x is independent of Xb, which implies that the random rotation matrix 
U is independent of X B , and hence Xb = UXb- Since D is diagonal with (n — s) ones and 
s zeros, M = n~ l W T W, where W G m(™-s) x I b I 

is a standard Gaussian random matrix. 

Consequently, we have 



\\M-I\U = \\n- 1 W T W - 1\\ 



2 



I Th S i ... 1 rp 1 rp 

< 1 W T W 2+ W T W-I\\ 

1 n 1 n — s n — s 



since |||TU T VF/(n - s)| 2 = 0(1), and 



'" -W T W-I\\ 2 = OU-^—) = 0(yfi/n), 



n — s V n 

using concentration arguments for random matrices (see Lemma [T3l in Appendix |E|) . 

For (|50b|) we may use the triangle inequality and the submultiplicativity of the norm so 
that 

\\[M~ l + M- 1 }- 1 -I/2\\ 2 = ||[M- 1 + M- 1 ]- 1 (/-[M- 1 + M- 1 ]/2)b 

< |[M _1 + M _1 ] _1 |||2 \\I - [M- 1 + M- 1 }/2\\ 2 

< I{|//2 - M-V2H2 + \\I/2 - M~ x /2\\ 2 }\\[M~ 1 + M-i}-% 
= \l\M~ 1 +M- l \-%0(J7Jv), 
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Finally, since KM" 1 + M _1 ] _1 || 2 = 0(1), equation ([50b]) is valid. 
In order to establish the bound (|50cp . we have 



\\M[M + M]- 1 - I /2\\ 2 = \\{M/2- M/2)[M + M}-% 

1 
2 



< ^{|||M-/||| 2 +||M-/| 2 } \\[M + M]-% 



= \\[M + M]- 1 \\ 2 0(^/s~J^). 

Since |||[M + M] - 2J|| 2 = 0(y/sjn) -»• 0, we have |||[M + M]" 1 ^ = 0(1), which establishes 
the claim (|50cj) . 

We are now ready to establish the claims of the lemma. From the representation (|49p . we 
apply triangle inequality and our bounds on spectral norms, thereby obtaining 



Kill + Kill < 1/ II5 - 2^(1/311 - VhfiWl + II5 + ^(l/ll - l/TO + 2IM 



s filial -i«iiii+>i>} 

with probability greater than 1 — c\ exp(— c 2 n), where r = z B — |(1 — j-d/S^I — |/3^|)). By 
the decomposition of z B in equation (|49j) and applying bounds (|50p 



||r[| 2 < ^{0(7^) + ^ 0( V / ^)lll/3l|-|/3lll| 2 + 7 -t r (l + 0(^)) [II/II2 + H/H2]} 

Since s/n = o(l), in order to establish the upper bound (|40bp it suffices to show that ||/|| 2 + 
| 2 = o(y / sA n ) w.h.p. Similarly, in the other direction, we have 



a 112 



-^ + ll*jlll > vWf + 2^alll^l -I^IIII-^W 



Following the same line of reasoning, in order to prove the lower bound (|40ap . it suffices to 
show that H/H2 + H/IJ2 = o(y/s\ n ) w.h.p. 

Since ||/|| 2 and ||/|| 2 behave similarly, it suffices to show that ||/|| 2 = o(A n y / i). From the 
definition (|55aj) . we see that conditioned on (Xb<=,w,J Bc ), the random vector / is zero-mean 
Gaussian, with i.i.d. elements with variance 



CT 2 . = ^(z BC )T(x T BC X B c/n)- 1 z BC + -w r Yl B ^w. 



Recalling that ||z^ c || 2 = (1 — a)s, we have 

2 < A^(l - a)s T n 1 ||n Bc xH||§ 



n n n 



By random matrix concentration (see the discussion following Lemma [13] in Appendix [Ej) , we 
have \ max ((X Bc XBc /n)^ 1 ) < 1 + 0(\f s/n) w.h.p., and by x 2 t au bounds (see Lemma [121 
in Appendix [E]) , we have llELeEi^lIk = 0(1) w.h.p. Consequently, with high probability, 
we have a 2 = 0{^- + -). Since the Gaussian random vector / has length \B\ = 0(s), 
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again by concentration for x 2 random variables, we have (with probability greater than 1 — 
ci exp(— C2s)), H/H2 = 0(a 2 s). Combining the pieces, we conclude that w.h.p. 

11/111 = + 

= °( A M^ + T^]) = o(X 2 n s), 
where the final equality follows since s/n = o(l) and l/(A^n) = o(l). 



D Convex-analytic characterization of optimal solutions 

This section is devoted to the development of various properties of the optimal solution(s) of 
the block l x jl oo-regularized problem (j6|) . 

D.l Basic optimality conditions 

By standard conditions for optimality in convex programs [26], the zero- vector must belong 
to the subdifferential of the objective function in the convex program ([6]), or equivalently, we 
must have for each p = 1,2, ... , r 

-{x\ r)^-V)V + An? = 0, (51) 

n n 

where Z E R pxr must be an element of the subdifferential 0||-B||oo l- Substituting the relation 
y l = X % ]3 1 + w l , we obtain 

-(X\ X l )0 l - p l ) - -{X i ) T w i + A n P = 0. (52) 
D.2 Proof of Lemma [2] 

We begin with the proof of part (i): suppose that steps (A) through (C) of the primal- witness 
construction succeed. By definition, it outputs a primal pair, of the form (Bjj , 0), along 
with a candidate dual optimal solution {(Zy , Zu c )}- Note that the conditions defining the 
^1/^00 subdifferential apply in an elementwise manner, to each index i = 1, . . . ,p. Since the 
sub-vector Zjj was chosen from the subdifferential of the restricted optimal solution, it is 
dual feasible. Moreover, since the strict dual feasibility condition (|30l) holds, the matrix Zjjc 
constructed in step (C) is dual feasible for the zero-solution in the sub-block U c . Therefore, we 
conclude that (Bjj , 0) is a primal optimal solution for the full block-regularized program ([6j). 
It remains to establish uniqueness of this solution. Define the ball 

r 

X = {Z eR pxr \J2\ 7 k\ < 1 Vjfe = l,...,p}, 

and observe that we have the variational representation 

||-B||i,oo = sup(Z, B) 
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where (•, •) denotes the Euclidean inner product. With this notation, the block-regularized 
program ([6]) is equivalent to the saddle-point problem 



inf supj^VHzZ-XV^ + An^, B)\. 



Since this saddle-point problem is strictly feasible and convex-concave, it has a value. More- 
over, given any dual optimal solution — in particular, Z from the primal-dual construction — 
any optimal primal solution B must satisfy the saddle point condition 

||-B||i,oo = sup(Z, B) 

But this condition can only hold if Vz S {1, 2, . . . , r}, /3L = for any index k E {1, . . . ,p} 
such that Yli=i l^fcl < 1- Therefore, any optimal primal solution must satisfy B\j c = 0, so 
that solving the original program (|6|) is equivalent to solving the restricted program (|29p . 
Lastly, if the matrices (Xy, -X^f) are invertible for each i G {1, 2, . . . , r}, then the restricted 
problem (|29p is strictly convex, and so has a unique solution, thereby completing the proof of 
Lemma G|l). 

We now prove part (ii) of Lemma [2 Suppose that we are given an estimate B of the true 
parameters B by solving the convex program ([6]) such that Bjjc = 0. 

Since B is an optimal solution to the convex program ([6]), the the optimality conditions 
of equation (152j) . must be satined. We may rewrite those conditions as 

±(Xjj, X*)(tf) - liXfjfw* + \ n zh = 
-{X\ JC , X%X) - -(Xfjcf w* + AnS&c = 0, 

where A* = j3 l — (3 l . Recalling that B\jc = Bjjc = 0, we obtain 

i(X^,X* / >(A 4 l/ )-i(^) T ^ + A n 4 = 0, and (53a) 

±(Xhc, Xfj)(A\j) - ^(Xhcfw 1 + \ n zlj c = 0. (53b) 

Again, by standard conditions for optimality in convex programs [21 [8], the first of these two 
equations is exactly the condition that must be satisfied by an optimal solution of the restricted 
program (I29p . However, we have already shown that the candidate solution Bjj satisfies this 
condition, so that it must also be an optimal solution of the convex program (|29p . Additionally, 
the value of Zjj that satisfies equation (j53aH for each i £ {1, 2, . . . , r} is an element of 3 1 1 B 1 1 \ . 
We have thus shown that steps (B) and (C) of the primal-witness construction succeed. It 
remains to establish uniqueness in part (A). However, we note that \Xjj, X\j\ is invertible 
for each i. Hence, for any solution B such that Bjjc = 0, 



*b = ^(xb,xb)r 1 



is well-defined and unique, noting that A l uc = 0. Thus, we have established the equality (|32p 
and that Bjj is unique. Therefore, B gives solutions to steps (A) and (B) when solving the 
restricted convex program over the set U. 
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Finally, we derive the form of the dual solution function of {Xy, Xfj), Zu, 

and B — B. Recall that (Xfj, Xfj) is invertible, Zjj is an element of the sub differential of 
o'll-^t/IUi/gooJ an( ^ Bu c = Bjjc = 0. From equation f)32[) . we have 

4 C = ^-{4, (/-n^ + V^.^;^^^)"^) fori = l,...,r. (54) 



The claimed form of the dual solution follows by substituting equation f|32|) into equa- 
tion (l53bj) . 

D.3 Subgradients on the support 

In this section, we focus on the specific form of the dual variables zfr. Our approach is to 
construct a candidate set of dual variables, and then show that they are valid. We begin 
by defining the sets B = S{j3 l ) n S(/3 J ), corresponding to the intersection of the supports, 
and the set B c = U \ B corresponding to elements in one (but not both) of the supports. 
For i = 1, 2, we let S l € K asxas is a diagonal matrix whose diagonal entries correspond to 
sign(/3g). In addition, we define the vectors /' G M as and matrices M l 6 



f := S l 



(55a) 



M* := is*<Xj„ (/-n^))^)^. (55b) 
Given these definitions, we have the following lemma: 

Lemma 7. Assume that r = 2, and that \j3g\ = If Bjjc = Bjjc = 0, then the dual 

variable z 1 satisfies the relation 

S ^ = —{M 2 [M 1 + M 2 } ~ l f l — M l [M 2 + M l Y 1 f 2 }+M l [M 1 + M 2 ] 

An 

^[(M 1 )- 1 + (M 2 )- 1 ]- 1 S diff (56) 

^n 

etne? ijc = §±(/3rc), TOi/i analogous results holding for z 2 . 

Given these forms for S 1 ^^ and S 2 z^, it remains to show that the relation S z^+S^z^ = 1 
holds under the conditions of Theorem[3ja). Intuitively, this condition should hold since under 
the conditions of theorem [3^a), the matrix M l is approximately the identity, and the vector 
p is approaching 0. Finally, we expect that -Bdiff : = — is very small, hence the final 
term is also very small. Therefore, on the set B, both S 1 ^^ and S 2 z^ are approximately 
equal to \. We formalize this rough intuition in the following lemma: 

Lemma 8. Under the assumptions of Theorem\3(a) each of the following conditions hold for 
sufficiently large n, s, and p with probability greater than 1 — c\ exp(— C2n): 

^[(M^ + CAf 2 )- 1 ]- 1 ^)^ < 6 (57a) 

|| — {M^M 1 + M 2 ]" 1 / 1 - M l [M 2 + M 1 ] _1 / 2 }l|oo < e (57b) 

An 

HM^M 1 + M 2 Y 1 l- -Hoc < --3e. (57c) 
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Given Lemmas [7J and [8l we can conclude that the definition for the dual variables on 
the support is valid. The remaining subsections in this appendix are dedicated to verify- 
ing the above results: in particular, we prove Lemma [7J in Appendix ID.4I and Lemma [8] in 
Appendix ID.5I 

D.4 Proof of Lemma [7] 

We now proceed to establish the validity of the closed form expressions for Zjj and z^j. From 
equation ()53a|) we have that 



Age — —( — (X l B c, X Bc 



— (X B c, X B )A B + X n z B c 



+ (-(X Bc ,X Bc ))-\X Bc ) T w 1 



substituting back into (|53ap 

-(Xh, X B )A B + -(X B , X BC )A BC - -(Xhfw 1 + \ n z B = 0, 
so that we obtain 

M 1 A 1 B = f 1 - \ n z B and similarly, (58a) 
M 2 A 2 B = f-\ n zl (58b) 

Recall that by assumption that S 1 f5 B = \(5 B \ = \(3 B \ = S 2 f3 B , and Sz B + Sz B = 1. 
Subtracting M 1 S 1 $ B and M 2 S 2 (3 B from equations (pal) and (l58bl) 

M 1 S\A 1 B - P B ) = f 1 - \ n S l z B - M 1 S 1 j3 B (59a) 
M 2 S 2 {A 2 B - Pi) = f- \ n S 2 z B - M 2 S 2 Wb (59b) 

Applying the fact that S 1 (A 1 B - fi B ) = S 2 (A 2 B - /3 2 ). 

(M 1 + M 2 )S\A B - B ) = (f 1 + f 2 ) - Xj- M 1 S 1 f3 B - M 2 S 2 P 2 B , 

where 1 £ R as . Then solving for S 1 (A 1 B - (3 B ) letting S l (3 B - S 2 f3 2 B = B dm and substituting 
back into equation (|59a|) 



XnS 1 ^ = M 1 [M 1 + M 2 ] l \ n l- [(M 1 )" 1 + (M 2 )- 1 ]- 1 ^) 

+ M 2 [M 1 + M 2 ] "V 1 - M± [M 1 + M 2 ] _1 / 2 - (60) 

D.5 Proof of Lemma [5] 

The first term ^-[(M 1 )" 1 + (M 2 )" 1 ] -1 B d iff can be decomposed as 

^[(M^ + iM 2 )- 1 }- 1 ^ = ■^([(M 1 )- 1 + (M 2 )- 1 ]- 1 -I/2)B dm + ^ 

Under the assumptions of Theorem [3^a) , we have |^ s, | — * 0, hence, for s large enough, 
T> ' i / I. 
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In order to bound Ti, we note that with probability greater than 1 — ci exp(— C2ra), the 
spectral norm of ([(Af 1 ) -1 + (Af 2 )" 1 ]- 1 - 1/2) is 0{yfsjn) (see the bound (l5Ubj) from Ap- 
pendix [C]). Consequently, we may decompose ([(Af 1 ) -1 + (Af 2 ) -1 ] -1 — 1/2) as QDQ T where 
Q and D are independent and Q is distributed uniformly over all orthogonal matrices, and 
II D 1 2 = 0(y/s/n). Using this decomposition, the following lemma, proved in Appendix ID. 61 
allows us to obtain the necessary control on the quantity HTiH^: 

Lemma 9. Let Q S M sxs be a matrix chosen uniformly at random from the space of all 
orthogonal matrices. Consider a second random matrix A, independent of Q. If s/n = o(l), 
then for any fixed vector i£l s and fixed e > 0, we have: 



(a) If \\A\\ 2 < ^fl, then 

¥[\\Q T AQx\ 

(b) If\\A\\ 2 < S, then 

F[\\Q T AQx\\ 



e i 2 n 

> 77 < c l ex P ( - c 2f TI 
2 s\\x\\ 



11- 



S*\\X\ 



+ log(s)). 



+ log(s)). 



With reference to the problem of bounding HTiH^, we may apply part (a) of this lemma 

with A = D and x = to conclude that HTiH^ < e/2 with high probability, thereby 
establishing the bound (|57ap . 



We now turn the proving the bound (|57b|) . We begin by decomposing the terms involved 
in this equation as 



^-M 2 [M 1 + M 2 ] V 1 = v- M 2 [M 1 +M 2 ] 1 -- 

An An 2 



^-M 1 [M 1 + M 2 ] 1 f 2 = — M 1 [M 1 + Af 2 ] 



-1_£ 
2 



2X n 
f 2 

f + — 
2X n 



Recalling the form of p £ conditioned on X l Bc and w l , we have 

- 1 4 c )/ QS + ||^|| 2 /(n 2 A 2 )/ Q 



r/(2X n ) ~ N (0, hzhc, -{-{XU Xjsc 
\ 4 n n 

However, by Lemmas 1121 and 1131 (see Appendix [E]) , as well as the fact that \\zlW 2 . = (1 — a)s, 
for n and s large enough, the variance term is bounded by 

\{z BC , -(kxU X BC ))^z BC ) + lkili/(n 2 A 2 ) < hi - a)-(l + <5) + ^ ( 61 ) 
4 n n 4 n 2 nAd 



with probability greater than 1 — c\ exp(— c^n). Hence, by standard Gaussian tail bounds, the 
inequalities ||/ 1 /(2A n )|| 00 < e/4 and ||/ 2 /(2A n )|| 00 < e/4 both hold with probability greater 
than 1 — ci exp(— 5' log(p — 2s)). 

Now to bound the first term in the decomposition we begin by diagonalizing M 2 = Q T DQ. 
Note that Q is independent of X 1 and D and by symmetry X B = QX B . Following some 
algebra, we find that 



Af 2 [M 1 + M 2 



2 



f = h Q T\ D [Q M lQT + D } l_I 
Am 6 



Qf 1 
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The random vector f 1 is independent of Q and Qf 1 is independent of Q by symmetry. Hence, 
the vector v : = \[2D{D + QM 1 Q T )~ l — IjQj-f 1 is independent of Q. For a given constant 
C3, let us define the event 

We can then write 

IP[||Q T ^l|oo > e] < P[||Q r H|oc >e\S] +F[S C \. 

Note that we may consider the event that \\D\l 2 = 0(1) and [2D(D + QM 1 Q T )~ 1 - I] = 
0(y/s/n). We claim that each of these events happens with high probability. Note that the 
former event occurs with high probability by Lemma [13] The latter event holds with high 
probability since, 

[2D(D + QM 1 Q T )- 1 -I} = [2D((D + QM 1 Q T )~ 1 - 1/2) + D - I}. 

and, both |||D-/|||2 = 0(y/sjn) and ((D + QM 1 Q T )~ 1 -1/2) = 0{^fsjn) by equation (IBHaT) . 
Thus, the sum of the two random matrices is also 0(y/s/n). 

Recall the bound on the variance of each component of f 1 from equation (|61|) and note 
that each component is independent. Applying the concentration results from Lemma [T2l for 
X-squared random variables yields that H/ 1 )!! < |(1 + <£)— + 5- It with high probability. 
Hence, under the above conditions 



l -[2D(D + QM l Q T r l -I}Q^f l f 2 < \\\]-[2D(D + QM l Q T )~ l - I\\\1\\Qy 



9 S 2 r S 1 

^ c i-[- + 



with high probabilty, which implies that S holds with high probability as well. Therefore, it 
immediately follows then that ¥[S C ] < c\ exp(— c 2 s). 

It remains to control the first term. We do so using the following lemma, which is proved 
in Appendix ID. 71 

Lemma 10. Let Q 6 jjmxm ^ e a ma i r { x chosen uniformly at random from the space of 
orthogonal matrices. Let v £ M. m be a random vector independent of Q, such that \\v\\ 2 < v* 
with probability one. Then we have 



F[||Q^|U>2,y^] = o(l). 

We now apply this lemma to the random vector v with m = s. and v* = C3 -f= + -r^- 
Note that 



= 2c 3 ,/!2Ii.± + -i- = (l), 
s V n \l n Xtn 



from which the second claim (|57bp in Lemma [8] follows. 
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Finally, we turn to proving the third claim (|57cp in Lemma Following some algebra, we 
obtain 

HAf 1 [Af 1 + Af 2 ]" 1 !- i||oo = ^(M 1 — M 2 )l + ^{M 1 — M 2 )(I/2 — (M 1 + M 2 )~ l )\. (62) 

We diagonalize the matrix M 1 = Q T DQ, where D is diagonal. Since the random matrix 
M 1 has a spherically symmetric distribution, the matrix Q has a uniform distribution over 
the space of orthogonal matrices and is independent of D. Using this decomposition, we can 
rewrite the second term in equation (1620 as 

\Q T {D - QM 2 Q T ){ 1 - — (D + QM 2 Q T y 1 )Ql = Q T RQl (63) 

where R := \{D - QM 2 Q T )(I - 2{D + QM 2 Q T )~ 1 ). We note that R is independent of 
Q, because D and M 2 are independent of Q. This independence follows from the spherical 
symmetry of M 2 and the fact that M 2 = QM 2 Q T . 

Defining the event T := {|||i2|||2 < 4s/ra}, we claim that 

P[T C ] < ciexp(-c 2 n) 0. (64) 

In order to establish this claim, we note that sub-multiplicativity and triangle inequality imply 
that 

l\R\h < \\D - QM 2 Q T \\ 2 \\(D + QM 2 Q T )/2 - I\\ 2 \\2{D + QM 2 Q T )~ 1 \\ 2 
< 2(\\D - I\\ 2 + |||/ - QM 2 Q T \\ 2 )\\{D + QM 2 Q T )/2 - /||| 2 , 

since \2{Q T DQ + QM 2 Q T )~ 1 \\ 2 < 2 with probability greater than 1 — ci exp(— c 2 n), from 
the discussion following Lemma [T31 Similarly, from this same result, we have C(||-D — I\\ 2 ) = 
0(1 J - QM 2 Q T \\\ 2 ) = 0(\\\(D + QM 2 Q T )/2 - J|| 2 ) < 2y/%, so that the claim flM} follows. 
Using the decomposition ([63]) and the tail bound ([64^1 . we have 

P[||Q T i2Ql|| 0O > e] = P[||Q T i?Ql||oo > e | T] +P[T C ] 
< {- s )+0(eM-c(e)n)), 

where Lemma [9] (proved in Appendix lD.6p provides control on the first term in the inequality. 
D.6 Proof of Lemma H 

We provide the proof for part (a) of the Lemma and note that part (b) is analogous. 
By union bound, we have 

PUI^AQsHoo > e] < s max ¥>[\e[Q T AQx\ > e]. 

i=l,...,s 

We will derive a bound on the probability P[|e^Q T A(5x| > e] that holds for all e^, i = 1, . . . , s. 
We write e\Q T AQx = xiv^Avi + vfAv 2 , where v\ denotes the first column of Q, and 
v 2 = Y2k=2 x kQk denotes the weighted sum of the remaining (k — 1) columns of Q. Since Q 
is orthogonal, the vector v\ has unit norm \\v\\\ 2 = 1, the vector v 2 is orthogonal to v%, and 
moreover H^Hl^lMloo 5- 1- Owing to the bound on the spectral norm of A, we have 

\xxViAvi\ < WxWoo^/- 
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which is less than e/2 for (s,n) sufficiently large, since s/n = o(l). 

We now turn to the second term. Note that conditioned on V2, the vector v\ is uniformly 
distributed over an (s — l)-dimensional unit sphere, contained within the subspace orthogonal 
to V2- Still conditioning on v%, consider the function f{v\) = vfAv2- For any pair of vectors 
Vi, v[ on the unit sphere, we have 

l/(^i)-/Ki)| 2 = \(vi-v' 1 ) T Av 2 \ 2 

< Uf 2 \\x\\l(s - l)\\ Vl - v[g 

= UWlMKs-i) [2(1 -cos^,^)))], 

where d = arccos(w^^) is the geodesic distance. Using the inequality cos(d) > 1 — d 2 /2, valid 
for d G [0,7r], and the assumption |||j4|||2 < y/s/n, and taking square roots, we obtain 

so that / is a Lipschitz constant on the unit sphere (with dimension s — 1) with constant 
L = IM|oo\/ n ( s — Consequently, by Levy's theorem [12], for any e > 0, we have 

P[|/MI>e] < 2exp(-(s-2) n e 2 ) < 2 exp ( - Cl e 2 ) . 

As a final side remark, we note that under the scaling of Theorem[3^b), we have ^e 2 — log(s) 
as n — > oo, so that the probability in question vanishes. 



D.7 Proof of Lemma 1101 

By union bound and symmetry of the distribution Q, for any t > 0, we have 

F[||Q T w||oo > t] < m¥[\e[Q T v\ > t] 
= m¥[\qjv\ > t], 



where q± is the first column of Q. Note that q\ is a random vector distributed uniformly over 
the unit sphere S m ~ l in m dimensions. Viewing the vector v G W 11 as fixed, consider the 
function f(q) = q T v defined over S™ 1 " 1 . As in Lemma [9l some calculation shows the Lipschitz 
constant of g over S m ~ l is at most L = ||i>||2- Applying Levy's theorem []2], we conclude that 
for any e > 0, 

t 2 

m¥[\f(q 1 )\ > t) < 2exp(-(m-l)—— 2 +logm). 



Since ||t>||2 < v* by assumption, it suffices to set t = 2v* 



E Some large deviation bounds 

In this appendix, we state some known large deviation bounds for the Gausssian variates, 
X 2 -variates, as well as the eigenvalues of random matrices. The following Gaussian tail bound 
is standard: 
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Lemma 11. For a Gaussian variable Z ~ N(0, a 2 ), for all t > 0, 

F[\Z\>t] < 2exp(-^). (65) 
The following tail bounds on chi-squared variates are also useful: 

Lemma 12. Let X be a x- s Q uare d random variable with d degrees of freedom. Then for all 
t > 0, we have 

X dt 2 
P[-7 > (1 + 1) 2 } < exp( — — ), and (66a) 

< (1 - 2t)] < exp(-dt 2 ). (66b) 

Proof. These tail bounds are immediate consequences of results due to Laurent and Mas- 
sart who prove that for all x > 0, we have 

¥[X > x + ( + Vd) 2 ] < exp(-x), and (67a) 

¥{X - d < -2y/dx) < exp(-x). (67b) 

Letting x = dt 2 /2 in equation (167aj) . we have 

exp(-^) > F[^>V2t + l + t 2 } > P[£>(l + t)2], 

thereby establishing (|66a|) . With the same choice of x, equation ()67b[) implies the bound (|66b|) 
immediately. □ 

Finally, the following type of large deviations bound on the eigenvalues of Gaussian random 
matrices is standard (e.g., [6]): 

Lemma 13. Let X £ M raxs be a random matrix from the standard Gaussian ensemble (i.e., 
Xij ~ iV(0, 1), i.i.d). Then with probability greater than 1 — c\ exp(— C2n), for any 5 > 0, its 
eigenspectrum satisfies the bounds 



2 X T X X T X 
(1-6) 1- J- < A min ( )<A max ( ) < (1 + 5) 



n n 



, s 

1 + 

n 



2 



Note that this lemma implies similar bounds for eigenvalues of the inverse: 
1 . ,,x T x, „ , ,,X T X^ S 1 



< Amin(( )- 1 )<Amax(( 



< 



r i 1 V linn \ \ / / iiici,w \ \ / i r . v 

(1+*) + n n a -*)[*->/*] 

From the above two sets of inequalities, we conclude for s/n < 1, we have with probability 
greater than 1 — c\ exp(— c^ri) 

\-X T X -L\\ 2 < 4W-, and (68a) 
n V n 

j(-X T X)- 1 -I\\ 2 < 44/?. (68b) 
n V n 

For random matrices where each row is distributed N(0, E) and A m j n (S) > C m j n and 
A m ax(S) < C max , we have 



|||-X J X-E||| 2 < A max (E)4W-, and (69a) 
n y n 

(I^ x) -l_ E -l||| 2 < -±=JI. (69b) 
n A min (E) V n 
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